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1. Introduction 

After the invention of lattice QCD in 1974 [1] it quickly became clear that (i) it had great 
potential to yield physically relevant nonperturbative results and (ii) enormous computational re¬ 
sources would be required to obtain them. Such resources were not readily available at the time. 
This led some of the physicists involved to design and build special supercomputers optimized for 
lattice QCD simulations. These activities started in the 1980s and are continuing until now. The 
resulting machines have been serving as work horses for large-scale lattice QCD simulations world¬ 
wide and were typically much more cost-efficient than commercially available machines. They also 
influenced the design of commercial supercomputers such as BlueGene. A probably incomplete 
list of such machines includes ACPMAPS, GFll, Fermi-256, QCDSP, QCDOC (all in the US), 
QCDPAX, CP-PACS, PACS-CS (Japan), as well as APEIOO, APEmille, apeNEXT, and QPACE 1 
(Europe), see also [2] for a review. 

However, in the past several years the situation has changed. Eirst, the development of spe¬ 
cialized processor chips has become too expensive for academic projects. Second, commercial 
supercomputers such as BlueGene scale well for lattice QCD and are reasonably cost- and energy- 
effective (although the development of truly scalable machines such as BlueGene or the K computer 
relies on enormous government funding that is not always forthcoming). Einally, standard compute 
clusters (typically with accelerators) are now readily available and can be used for capacity-type 
problems where strong scalability is not a must. 

Nevertheless, it still makes sense to pursue academic hardware-development activities by com¬ 
bining commercially available components in an innovative way, in particular regarding the com¬ 
munication between processors. This can result in machines that are more scalable and more cost¬ 
and energy-effective than commercial machines. The development activities typically involve in¬ 
dustry partners that also benefit from the innovative concepts originating on the academic side (“co¬ 
design”). In our most recent projects these partners were IBM (QPACE I [3]) and Intel (QPACE 2) 
as well as Eurotech (QPACE 1 and 2). Note that the current “custom” designs for lattice QCD are 
suitable for a broader application portfolio than they used to be in the past. Note also that hardware 
development is only part of the story and that system software and high-performance application 
codes are equally essential (and in fact absorb a large fraction of the development activities). 

In Sec. 2 we give an overview of the QPACE 2 project, which is based on the Knights Corner 
(KNC) version of the Intel Xeon Phi processor. We also discuss in some detail our experiences 
with this new “many-core” architecture and give recommendations for programming it. In Sec. 3 
we present a new solver code tailored to the KNC. This code is based on domain decomposition 
(DD), which is less bandwidth bound and more latency tolerant than other approaches, and thus 
leads to better scaling behavior [4]. In Sec. 4 we summarize and give an outlook to QPACE 3, 
which will be an upgrade of the QPACE 2 prototype system, based on the future Knights Eanding 
(KNE) version of the Xeon Phi. 

2. Overview of QPACE 2 

Between QPACE 1 and QPACE 2, the acronym QPACE changed its meaning from “QCD 
Parallel Computing on the Cell Processor” to “QCD Parallel Computing Engine”. The QPACE 2 
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Figure 1: Left: QPACE 2 node design, see text for details. Right: Sketch of the two-dimensional hyper¬ 
crossbar network (based on FDR Infiniband) of QPACE 2, here shown for 16 x 8 nodes as an example. Each 
dot represents a node with two IB ports that is connected to one IB switch each in the x- and y-directions. 
The switches, which here have 32 ports each, are indicated by the red and blue squares, respectively. 


project is funded by the German Research Foundation (DFG) and led by the University of Regens¬ 
burg in collaboration with Eurotech (Italy/Japan) and Intel, with additional contributions from the 
Jiilich Supercomputing Center and the University of Wuppertal. ^ 

2.1 Node design 

Like most supercomputers, QPACE 2 consists of many identical nodes connected by a net¬ 
work. The node design is shown in Eig. 1 (left). Eour Intel Xeon Phi 7120X processors (a.k.a. 
Knights Corner or KNC) are connected to a PCIe switch (PEX 8796 by PLX). Also connected to 
the switch are a low-power CPU (Intel Xeon E3-1230L v3) and a dual-port EDR Infiniband card 
(Mellanox Connect-IB). The CPU serves as the PCIe root complex. The KNCs as well as the IB 
card are PCIe endpoints. Peer-to-peer (P2P) communication between any pair of endpoints can 
take place via the switch. The rationale behind this node design is that a high-performance network 
is typically quite expensive. A “fat” node with several processing elements and cheap internal 
communications (here over PCIe) has a smaller surface-to-volume ratio and thus requires less net¬ 
work bandwidth per floating-point performance, which lowers the relative cost of the network. The 
number of KNCs and IB cards on the PCIe switch is determined by the number of lanes supported 
by commercially available switches and by the communication requirements within and outside of 
the node. We are using the largest available switch, which supports 96 lanes PCIe Gen3. Each 
of the KNCs has a 16-lane Gen2 interface (corresponding to a bandwidth of 8 GB/s), while both 
the CPU and the IB card have a 16-lane Gen3 interface (i.e., almost 16 GB/s each). The external 
IB bandwidth for two EDR ports is 13.6 GB/s. This balance of internal and external bandwidth 
is consistent with the communication requirements of Lattice QCD, see also Sec. 3. Each of the 
KNCs has 61 cores, a clock speed of 1.238 GHz, a peak performance of 1.2 TElop/s (double pre- 

*The development of QPACE 2 was initially pursued in collaboration with T-Platforms. Unfortunately, this collab¬ 
oration was terminated after T-Platforms was placed on the so-called Entity List of the US Department of Commerce in 
March 2013 and no longer had access to US technology, such as Intel Xeon Phi processors. This caused a significant 
delay for QPACE 2. Note that T-Platforms was removed from the Entity List in December 2013. 
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cision), and 16 GB of GDDR5 memory with a peak bandwidth of 352 GB/s. All of these numbers 
are nominal. Sustained numbers will be discussed in Sec. 2.5. 

One advantage of our node design is that the KNCs could in principle be replaced by GPUs, 
although in this case a stronger CPU would be advisable. 

2.2 Network 

The nodes are connected by an IB-based network with a hyper-crossbar topology. This topol¬ 
ogy was introduced by the CP-PACS collaboration [5]. For QPACE 2 we use a two-dimensional 
version as indicated in Fig. 1 (right). In general, if we have nodes with d ports each and switches 
with p ports each, a r/-dimensional hyper-crossbar has a maximum size of p‘^ nodes. We are using 
IB edge switches with 36 ports, p = 32 of which are used for connecting the nodes. Thus our maxi¬ 
mum partition size is 1024 nodes, corresponding to 4096 KNCs (which is sufficient for present-day 
lattice sizes and more than our budget can afford). The remaining 4 switch ports can be used for 
connecting the machine to a storage system and/or for connecting the switches in a torus (using the 
torus-2QoS routing engine in the OpenSM subnet manager). 

The advantage of a hyper-crossbar over a torus is that we have full connectivity in every 
single dimension with one switch hop, and all-to-all connectivity by going through at most one 
intermediate node. The advantage over a (fat) tree is lower cost. We note for completeness that a 
higher-dimensional application can be mapped to a lower-dimensional hyper-crossbar in a variety 
of ways. Typically this should be done such that the communication requirements are minimized. 

2.3 System design 

The components of a node are integrated in what we currently call a “brick”.^ The basis of a 
brick is a midplane, see Fig. 2 (left). This midplane accommodates the PCIe switch, a power con¬ 
nector and a number of power converters, and six slots for standard PCIe form factor cards, i.e., a 
CPU card (Fig. 2 middle), a Connect-IB card (Fig. 2 right), and four KNC cards (Fig. 3). The KNC 
card (by Intel) and the Connect-IB card (by Mellanox) are commercial components. The midplane 
and the CPU card were designed for QPACE 2 but can be reused for other projects/products. The 
CPU card also contains 16 GB of DDR3 memory, a platform controller hub (PCH), and a base¬ 
board management controller (BMC). It provides two Ethernet and several other interfaces such as 
USB, VGA, and SPI. One of the Ethernet interfaces is used by the BMC and the other by the CPU 
to form two separate Ethernet networks. The BMC network is used for low-level control of the 
nodes. The CPU network is used for booting the operating system and for login. Both networks 
are also used for system monitoring. 

The cooling concept of QPACE 2 is a novel idea based on roll-bond technology (see p. 253 
and Fig. 8.23 of [6] for an explanation of this technology). The concept is shown in Fig. 3 for 
a KNC card. The other cards as well as the midplane are also cooled in this way. Water flows 
through a roll-bond plate (made of aluminum) which is thermally coupled to the hot components 
via interposers (made of aluminum or copper) and thermal grease or thermal interface material. 

^This name is due to the obvious similarity of our mechanical design, see Fig. 5, with a brick used in the con¬ 
struction industry, but we should probably change it in the future to avoid confusion with the alternative meaning of 
“non-functioning electronic device”. 


4 




QPACE 2 and Domain Decomposition on the Intel Xeon Phi 


Tilo Wettig 



Figure 2: Some components of a QPACE 2 node (not to scale): Midplane (left), CPU card (middle), and 
Connect-IB card (right). The power cables on the midplane are needed for the KNCs. See text for details. 



Figure 3: QPACE 2 cooling concept, here for the KNC card. Left: The hot components of the card (middle) 
are thermally coupled to the roll-bond plate (right) via interposers (top and bottom). The copper plate is 
inserted in the hole of the interposer and interfaces with the KNC chip through thermal grease. The yellow 
pieces are thermal interface material. Right: Eully assembled water-cooled KNC card. The top interposer is 
coupled to the roll-bond plate with thermally conductive glue. The bottom interposer, which cools memory 
chips on the bottom of the card, is thermally coupled to the top interposer via aluminum surfaces (visible in 
Eig. 4 right) and screws. 


The advantage of the roll-bond technology is that it is cheap and widely available and that custom¬ 
shaped water channels are very easy to implement. In addition, the risk of leakages is minimized. 
With other technologies it is much harder or more expensive to create channels of arbitrary shape 
and to make sure that there are no leakages. Our design allows for cooling water temperatures of 
at least 50°C, which makes free cooling possible year-round. Thus no chillers are needed, which 
in our installation improves the overall energy efficiency of the machine significantly. 

Once the cooling solutions have been assembled, the six PCIe cards are plugged into the 
midplane, as shown in Fig. 4. The six roll-bond plates are connected to a mini-manifold with 
plastic tubes. The entire assembly is then put into a housing, see Fig. 5. The mini-manifold is 
visible in Fig. 5 (right). It is connected to the cooling-water circuit via drip-free quick-disconnect 
couplings. The lever visible in Fig. 5 (left) locks the power connector and the quick disconnects 
into place in the rack. 
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Figure 4: Left; The PCIe cards with their cooling assembly are plugged into the midplane. Here only one 
KNC card is shown. Right: One node (with all six cards liquid-cooled but without housing) is undergoing 
testing. 



Figure 5: Left; A fully assembled node in its housing. Several interfaces are provided on the front panel; 
dual FDR Inhniband, dual Ethernet, USB, VGA, and debug connector. Right: Back side of the brick with 
power connector, mini-manifold, and drip-free quick-disconnect couplings (male parts) for in- and outlet. 


We use a standard 19-inch rack of height 42U. The height of a brick is 3U, and we can put 
4 + 4 bricks in 3U (4 from the front of the rack, 4 from the back). This translates to 64 nodes, 
i.e., 256 KNCs, in 24U. The remaining space in the rack is used for the cooling-water distribution, 
power distribution, 3 Ethernet and 4 Infiniband switches, a management/login server, and cable 
management. The peak performance of a rack is about 310 TFlop/s (double precision). 

12V power is delivered to the nodes via massive power bars made of copper and small power 
backplanes, see Fig. 6. The latter also contain a DIP switch that is used to assign a unique location 
ID to each brick. Based on this location ID we then define IP addresses. The power bars are 
fed by industry-standard platinum power supply units (PSUs) that convert 230V AC to 12V DC 
at about 95% efficiency, see Fig. 7 (leff). Each PSU can deliver up to 2 kW. We subdivide the 
rack power distribution into 8 domains, with each domain powered by 6 PSUs serving 8 bricks 
in parallel. Assuming a typical power consumption of 1 kW per node, we are just on the border 
between 5 + 1 and 4 + 2 redundancy. A PSU control board designed in Regensburg monitors and 
controls the PSUs via PMBus, using a BeagleBone Black single-board computer that plugs into the 
master control board, see Fig. 7 (right). One PSU control board services 16 PSUs, hence we have 
3 control boards for 48 PSUs. The PSUs can deliver a maximum power of 96 kW per rack. First 
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Figure 6; Left: Power bars with power backplanes attached via cables. Also visible are the quick-disconnect 
couplings (female parts) for in- and outlet. Right: A brick (of height 3U) is being inserted into its slot. 



Figure 7; Left: The PSUs are connected to the power bars via power distribution boards (PDB). Each 
PDB holds two PSUs. Here, only a single PSU is shown. Right: The master PSU control board (right) 
accommodates a BeagleBone Black single-board computer. The other PSU control boards (e.g., the one on 
the left) run in slave mode. 

results suggest that synthetic benchmarks will consume up to 85 kW, whereas high-performance 
QCD code will run at about 60 kW. 

2.4 System software 

The BMC on the CPU card is running an embedded Linux version and boots from flash mem¬ 
ory as soon as power is turned on. It supports the typical functionalities of a board management 
controller, e.g., it can hold the other devices in the brick in reset or release them, it can monitor 
voltages as well as current and temperature sensors and act accordingly (e.g., by performing an 
emergency shutdown), or it can access the registers in the PCIe switch via an I^C bus. Note that 
our nodes are diskless. Once the CPU is released from reset it PXE-boots a minimal Linux image 
over the Ethernet network. The full Einux operating system (currently CentOS 7.0) uses an NES- 
mounted root file system. The KNCs are booted and controlled by the CPU using Intel’s KNC 
software stack MPSS. The KNCs support the Eustre file system of our main storage system, which 
is accessed over Infiniband. A variety of system monitoring tools are running either on the BMC 
(as described above) or on the CPU, which, for example, regularly checks the temperatures of all 
major devices (KNCs, CPU, Infiniband HCA, PCIe switch) as well as various error counters (ECC, 
PCIe, Infiniband). A front-end server is used to NES-export the operating system, to communicate 
with the nodes and monitor them, to log in to the machine, and to control the batch queues. 
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2.5 KNC architecture and its implications for the programmer 

Let us first describe some relevant details of the KNC architecture, focusing on the 7120X 
version used in QPACE 2. The processor chip contains 61 compute cores running at 1.238 GHz. 
The cores are connected by a bidirectional ring bus, which also interfaces to the GDDR5 memory 
controllers (for a total of 16 GB of on-board memory) and the PCIe I/O logic (16 lanes PCIe Gen2). 
Each core has a 512-bit wide SIMD unit on which it can perform fused-multiply-add instructions 
in either single or double precision. This translates to a peak performance of 1.2 TElop/s per KNC 
in double precision. There are four threads per core that have their own register set but shared 
execution units and caches. Each core has a private E2 cache of 512 kB. The E2 caches are kept 
coherent with a distributed tag directory, i.e., if core A needs data that are not in its E2 but in the 
E2 of core B, these data are automatically fetched from core B via the ring bus, rather than from 
memory. This is faster than memory access, but not by much in practice. 

While the KNC architecture is quite promising, the actual hardware implementation has some 
performance-relevant pitfalls that one should be aware of. Eor example, instructions from a given 
thread can only be issued every other cycle. Therefore we need at least two threads to issue instruc¬ 
tions in each cycle. Also, there is no El hardware prefetcher. Eurthermore, the cores execute in¬ 
structions in order, which implies that cache misses always stall execution of a thread for 10 ~ 100 
cycles. To hide these stalls we also need to use multiple threads. 

After initial theoretical performance analysis, we have performed microbenchmarks and ex¬ 
perimented heavily while implementing and optimizing our software. We believe to have arrived 
at a solid understanding of how to write high-performance code for the KNC. The purpose of this 
subsection is to describe some of the insights we have gained, starting from a single core and going 
to the full multi-node system. 

Eor some parts of the application, such as the stencil computation for the Wilson Dirac oper¬ 
ator, the small size of the El cache (32 kB) can become a limiting factor. There are two reasons 
for this. Eirst, while the El size has been kept constant, the size of the vector registers has been 
doubled compared to AVX (which in turn is double that of SSE). As a consequence the size of the 
working set is typically larger, which leads to more El capacity cache misses, i.e., data are more 
frequently evicted from El before they can be reused. Second, to fully use a core we need to use 
two threads, and to hide stalls due to cache misses we would often like to use three or four. Thus 
the El size per thread is less than 32 kB. The 8-way associativity of the El may further reduce 
the available space per thread since multiple threads compete for the 8 ways (unless they work on 
the same data). The lack of an El hardware prefetcher also means that software prefetching can 
be crucial. With well-placed El software prefetches^ we often get good results with two threads. 
Without optimized El prefetching typically three or four threads yield the best performance, at the 
cost of incurring the negative effects described above. 

The E2 cache is coherent but not shared between cores. This has several consequences. In a 
typical stencil computation each core would work on a distinct block of the lattice that needs to be 
held in cache. In addition, also neighboring sites (typically denoted as halos or ghost shells) need 
to be accessed, and thus be loaded into the cache. Without a shared cache this implies duplication 

^The icc compiler does well in this respect for simple code parts like loops, but manual prefetches are beneficial for 
more complex parts like the application of the Dirac operator. 
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at the L2 level: data for each lattice site may be present in the L2 caches of several cores. Simi¬ 
larly, the application code will by duplicated A^core ~ 60 times, since each core needs a copy of it. 
Finally, barriers are very expensive, up to 10,000 cycles for optimized implementations (see, e.g., 
an implementation by Intel as part of the QPhiX library [7]), while many standard implementations 
(such as in OpenMP and pthreads at the time of this writing) need well above 10,000 cycles. 

As described above, a substantial fraction of each core’s L2 is “wasted” for duplication of 
code and halos, thus reducing the fraction available for local data. Therefore we typically need a 
considerable amount of main-memory access. Stream benchmarks yield on the order of 170 GB/s 
for the sum of read and write bandwidth (i.e., less than 50% of the nominal 352 GB/s). This band¬ 
width can be sustained for simple cases like streaming access thanks to the L2 hardware prefetcher. 
For more complex cases software prefetching may become necessary. The general memory access 
pattern is of importance as well. To obtain a bandwidth close to 170 GB/s, all cores must read 
and/or write simultaneously and continuously, i.e., if each core reads and/or writes in bursts (with 
breaks in between) the resulting overall bandwidth is much lower. Furthermore, if each core reads 
streams from more than about 5 ~ 10 memory locations at a time, the overall bandwidth will not 
reach 170 GB/s, not even with manual L2 prefetching. This may be alleviated by modifying the 
data layout so that data needed at the same time are placed contiguously, e.g., by packing the gauge 
links for all four directions in one array instead of keeping a separate array for each direction. 

For moving data over the network we typically rely on MPI, which in our case can use Intel’s 
SCIF library (for intra-node communication) and/or the Infiniband HCA (for intra- and inter-node 
communication). SCIF shows good results in microbenchmai'ks: RDMA between two KNCs sus¬ 
tains a bidirectional bandwidth of about 6.5 GB/s for payloads of 64 kB and above, while PIO 
(programmed 10) sustains about 5 GB/s for payloads of 16 kB and above. These numbers are to be 
compared to the nominal value of 7.2 GB/s, which takes into account the overhead for the 256-Byte 
PCIe packages supported by the KNC [8]. The zero-size latencies are about 5.5 p.s for RDMA and 
about 2 ps for PIO. However, in practice we experience issues: when using SCIF RDMA calls, 
the kernel seems to heavily utilize 5 ~ 10 cores to do the data movement. These cores then be¬ 
come unusable for computation, since the application threads on these cores would slow down the 
whole machine.'^ Intel MPI via the HCA, which utilizes the DMA engine of the HCA rather than 
that of the KNC, does not show these issues. Another point is that the latency induced by the 
Intel MPI software stack is quite high. In microbenchmarks MVAPICH2-MIC seems to perform 
a bit better than Intel MPI (see also [9]), but at the time of this writing MVAPICH2-MIC always 
shows the SCIF issue explained above, even in cases where only the HCA should be used. In some 
performance-relevant parts we were successful in improving the performance by using Infiniband 
directly (i.e., without MPI) via the IB verbs library [10], at the cost of significantly increased com¬ 
plexity for the programmer. 

2.6 Application software 

We mainly use Chroma [11]. For the performance-relevant parts such as the DD-based solver 

“^One solution would be to leave these cores free, giving up 15% of the total performance, but even that is not enough: 
the kernel may occasionally schedule SCIF-related threads on one of the “compute” cores, again causing slowdown. In 
a big system this will of course happen very frequently. The situation could be improved by pinning these SCIF-related 
kernel threads to specific cores. This would be an interesting option for future work. 
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described in Sec. 3 we wrote separate libraries that can be linked during Chr'oma compilation. 
The Wuppertal adaptive algebraic multigrid code [12, 13] has been ported to the KNC, including 
vectorization of all performance-relevant parts. In this code a Krylov solver is preconditioned with 
a V-cycle and a smoother. For the latter we use our DD preconditioner. Other parts of our code suite, 
such as HMC, hadron spectra, and hadron distribution amplitudes are also being adapted to the 
KNC. Furthermore, we made some improvements to QDP-i-i- [11], such as memory management, 
optimization of existing threading, and some new threading. As for the actual implementation, 
the code is written in C-i-i-, the performance-relevant parts are vectorized using ice intrinsics or 
auto-vectorization by the compiler, threading uses OpenMP, and the multi-node implementation is 
currently based on Intel MPI or IB verbs as described in Sec. 2.5. 

3. Domain decomposition on the Knights Corner processor 

Most of the content of this section is a summary of work done in collaboration with the Intel 
Parallel Computing Labs in Santa Clara (USA) and Bangalore (India) and with JLAB (USA), the 
details of which are described in [14]. This work was motivated by the first implementation of a 
conjugate-gradient-based solver on the KNC [15] (see also [16, 17, 18] for other implementations). 
The strong-scaling behavior of that solver on a production system (TACC Stampede) turned out to 
be rather limited, the performance bottleneck being the bandwidths for memory access and inter¬ 
node communication. The obvious way to improve upon the strong-scaling behavior is to switch 
to another algorithm, based on domain decomposition, that moves less data between KNC and 
memory, and between different KNCs. 

In fact, this is a good example of how the implementation of high-performance application 
code on a new hardware architecture can proceed. First, one identifies the most suitable algorithm 
for the given hardware architecture. Second, one identifies the most suitable data layout for the 
given algorithm and hardware architecture. Only then should one proceed to optimize the code 
(including vectorization, threading, parallelization, etc.). 

3.1 Algorithm and implementation overview 

Let us briefly describe some details of our algorithm. For the Dirac operator we use Wilson- 
Clover. Our outer solver is flexible GMRES with deflated restarts [19]. Asa preconditioner we use 
the multiplicative Schwarz method (a.k.a. Schwarz alternating procedure), which was introduced 
in [20] and adapted to Lattice QCD in [4]. The main idea of the Schwarz method is to subdivide 
the lattice into domains. After a reordering of indices, the matrix to be inverted then consists of a 
block-diagonal part (corresponding to the interactions within the domains) and a remainder (cor¬ 
responding to the interactions between domains). Based on this splitting, a block-Jacobi iteration 
can be performed to approximate the inverse. If the domains are small enough to fit in cache, the 
ensuing inversion on the domains (i.e., the block-diagonal parts) can be done without off-chip data 
movement.^ Computation of the remainder does require communication but occurs less frequently. 
As a result, the overall algorithm needs less communication bandwidth, is more latency tolerant, 
and gives more cache reuse than the conjugate-gradient algorithm. 

^For the inversion on the domains we use the minimal-residual (MR) algorithm with even-odd preconditioning. 
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3.2 Data layout and vectorization 

For optimal performance it is necessary (i) to fully utilize the SIMD units, (ii) to choose the 
data layout such that the cache lines loaded into the KNC do not contain unnecessary data, and 
(iii) to avoid instruction overheads due to SIMD vector permutations. Typically one cannot satisfy 
all criteria simultaneously and tries to find a compromise that maximizes the overall performance. 
In our case, the best solution is a structure-of-array (SOA) format in which all 24 floating-point 
components of a spinor are stored in 24 separate registers and cache lines. This leads to “site 
fusing”, i.e., one 512-bit register contains data from 16 sites (in single precision). Since we have 
one small domain per core (see Sec. 3.3) we have to site-fuse in more than one dimension. For 
example, for an 8 x 4^ domain we site-fuse the x- and y-dimensions, with data from 4x4 sites in 
each register.^ 

In this scheme, the computation of hopping terms within domains can be done straightfor¬ 
wardly in the z- and t-directions for complete registers, while in the site-fused directions some 
permute and mask instructions are needed that lead to an underutilization of the SIMD units of 
12.5% in the x- and 25% in the y-direction. For the computation of hopping terms between do¬ 
mains, again the z- and t-directions are straightforward, while in the site-fused directions the cache 
lines that need to be loaded to access the neighbor’s boundary data contain many data that are not 
needed. We avoid this overhead by additionally storing the boundary data in an array-of-structure 
(AOS) format [14]. 

3.3 Cache management and prefetching 

Since the L2 is not shared we assign each domain to a single core. The cache size of 512 kB 
per core restricts the domain size to 8 x 4^ (in single precision). We actually implemented domains 
of size 4“* and 8x4^. Since the KNC can do up- and down-conversion, we store some domain 
data, i.e., the gauge links and clover matrices, in half precision. This reduces the working set (from 
456 kB to 312 kB per 8x4^ domain) and the memory-bandwidth requirements. To ensure stability, 
we keep the spinors in single precision. We observed that this optimization had no noticeable 
impact on the iteration count of the outer solver. 

As discussed in Sec. 2.5, the KNC has no LI hardware prefetcher, and an L2 hardware 
prefetcher only for streaming access. Because of the irregular code structure, compiler-generated 
software prefetches are often not good enough. It is therefore essential to manually insert LI and 
L2 prefetches using intrinsics. The fine-tuning of these prefetches turned out to be rather time- 
consuming. 

3.4 Intra-core threading 

As mentioned in Sec. 2.5, at least two threads per core are necessary to fully utilize the floating¬ 
point unit. In our implementation we assign threads to alternating time slices within a domain. 
There are pros and cons for using two or four threads per core. Two threads suffer from more stalls 
since LI or L2 misses cannot be hidden by other threads. Four thr'eads suffer from more LI conflict 
misses because our working set exhausts the LI size. We measured the performance for both cases 
and found no significant differences. 

®Our notation is always x Ly x x L, i.e., 8x4^ means = 8 and Ly = Lj- = L = 4. 
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3.5 Inter-core parallelization 

When parallelizing over the cores of the KNC, synchronization between cores is only neces¬ 
sary after an MR block solve, which implies that the cost of a barrier (which can be quite large on 
the KNC, see Sec. 2.5) has little impact on the overall performance. 

A simple but very performance-relevant issue is load balancing. Standard lattice sizes are 
typically powers of 2 and thus do not map naturally to 60 or 61 cores. As a result, some of the 
cores are unused at least part of the time (“load imbalance”)- A simple solution would be to 
include factors of 3 and 5 in the lattice size, but this is only an option for new simulations where 
we can select the lattice size. Another option is to increase the local volume (per KNC) so that 
the number Adomain of domains assigned to a single KNC^ is much larger than the number Acore of 
cores per KNC. The average load is then given by the function xj ceil(.r) with a: = Adomain/2Acore 
[14], which has a saw-tooth-like behavior with the minima approaching 1 from below as Adomain 
increases. However, increasing the local volume is not an option in the strong-scaling region for 
which we are trying to optimize. The third option is a non-uniform distribution of the lattice over 
multiple KNCs (assuming a multi-node implementation as described below). Let us give a simple 
example to illustrate the idea, using a lattice size of 64^ x 128 and a domain size of 8 x 4^. If the 
lattice is uniformly distributed over, say, 1024 KNCs ina4x4x8x8 layout, we have Adomain = 64 
and thus a load of 53%.^ In that case each KNC contains 16 sites in the t-direction. Alternatively, 
we could split the t-direction non-uniformly as 128 = 4 • 28 -f 16 so that we need only 5 (instead 
of 8) KNCs for the t-direction. This reduces the number of KNCs from 1024 to 640 and increases 
the average load to (4-56-1- 32)/(5 • 60) = 85%. The time to solution will not decrease (and may 
in fact increase slightly because the boundaries to be communicated will be larger), but the cost to 
solution will decrease by almost 40%. 

3.6 Multi-node implementation 

For the parallelization over many KNCs using MPI we have several options. For example, 
each thread could issue its own MPI calls. However, this option has two disadvantages. First, the 
overhead of many simultaneous MPI calls is high in typical MPI implementations. Second, the 
packets generated by each thread are too small to efficiently utilize the network. Therefore, a better 
option is to combine the surface data of all domains and to communicate them using only a single 
thread. This option also has a disadvantage, i.e., the need for explicit on-chip synchronization, but 
in practice it performs better than the other option. 

As in any parallelization effort, it is essential to overlap computation and communication in 
order to hide the communication latencies. The standard method is to divide the local volume into 
interior and surface and to work on the interior while waiting for the surface communication. For 
our domain-decomposition approach this does not work for typical parameters since most domains 
would be on the surface, i.e., we have only a small interior, or none at all. Instead, we devised 
a new method, see Fig. 8. In the left part we first show a splitting in a single direction, here t. 
Computations are done on each time-slice in the order indicated by the numbers. Boundary data 

^Note that the number of domains that can be processed independently is given by ^domain = Aiomainf2. The factor 
of 2 is due to the checkerboarding in the multiplicative Schwarz method. 

*Here, we use Aore = 60 since Intel configures the Linux kernel to run on the last core. 
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Figure 8: Illustration of our method to hide communication latencies. Each little box represents a domain. 
Left; Splitting only in the f-direction. Middle; Splitting in t- and z-direction. Right; Linear representation 
of the scheme in the middle. See text for details. 

can be sent after (1) has finished. These data are only needed for (1) in the next iteration. Thus, the 
communication of (1) overlaps with the computation of (2)-(4). However, this only works for t, but 
not for the other directions because their boundaries can only be sent after (4) has finished. A better 
scheme is shown in the middle and right parts. Here, we split in the t- and z-directions (in practice 
we split in all directions, and the same principle applies). Again, the order of the computations is 
indicated by the numbers, while the order of the communications is indicated by letters. As before, 
(a) occurs after (1), but in the z-direction we now send boundary data when the computation on 
half of the boundary has finished, i.e., (b) occurs after (2). The other half of the boundary is sent 
at the end, i.e., (c) occurs after (5). The first half is needed at the start of the next iteration, which 
means that (b) overlaps with (3)-(5). The second half is needed before (4) of the next iteration so 
that (c) then overlaps with (l)-(3). 

3.7 Results 

In this subsection we discuss the performance of our code for three cases; single core, single 
KNC, and many KNCs. Since we are interested in realistic lattice sizes, for which the lattice is 
distributed over many KNCs, the relevant parts of the code are MR iteration and DD preconditioner 
for the single-core and single-KNC cases, and full solver for the many-KNC case. 

We first present some theoretical single-core performance estimates for the Wilson-Clover op¬ 
erator, all in single precision. We assume that our domains fit in cache so that we are not memory- 
bandwidth bound. Of the 1848 flops/site in Wilson-Clover, 64% are fused multiply-adds, which 
gives an upper performance limit of 82%. Instruction overheads due to masking, shuffles and per¬ 
mutes as well as imperfect instruction pairing reduce the theoretical limit to 56% or 22 Gflop/s/core, 
see [14] for details. Our measured performance for the MR iteration is about 13.3 Gflop/s/core. 
According to VTune (Intel’s performance-analysis tool) this performance loss is mainly due to 
stalls because of outstanding LI prefetches. Since the optimal number of MR iterations in terms 
of time-to-solution is only 4 ~ 5, the Schwarz method is not completely dominated by MR, i.e., 
other parts contribute noticeably, and the overall performance of the Schwarz method drops to 9.5 
Gflops/s/core. Table 1 shows the impact of our optimization efforts on the performance numbers. 

Let us now discuss the many-core performance of the DD preconditioner when parallelizing 
over 60 cores of the KNC, see Fig. 9. We clearly observe the load-balancing issue discussed 
in Sec. 3.5. As long as the local volume and thus the number of domains is large enough, the 
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MR iteration 
single half 

DD method 
single half 

no software prefetching 

6.1 

8.9 

4.6 

6.6 

El prefetches 

10.4 

13.3 

6.5 

8.7 

E1-I-E2 prefetches 

10.2 

13.3 

7.1 

9.5 


Table 1: Single-core performance in Gflop/s (single precision) for the 7120X. See text for details. 



Figure 9: On-chip strong-scaling of the DD preconditioner (with 5 MR iterations and 16 Schwarz iterations). 
The first two volumes are chosen such that ndomain is divisible by 60, i.e., on 60 cores we have 100% load. 
The other volume corresponds to distributing a 48^ x 64 lattice over 64 KNCs, with a load of 90% on 60 
cores. 

effect is small, but once A^domain gets close to A^core it becomes serious. Apart from this effect the 
scaling is almost linear, because (i) during the MR inversion the cores run independently from their 
respective caches, i.e., memory bandwidth does not become a bottleneck, and (ii) the other (i.e., 
non-MR) parts of the DD method, which do require synchronization, are sub-dominant. 

Since the full QPACE 2 machine is not yet available, we performed multi-node benchmarks 
of the complete solver on up to 1024 nodes of the TACC Stampede cluster, which uses the 7 HOP 
version of the KNC. Due to the limited space we do not give a detailed discussion here but refer 
to [14] for details. As an example, we briefly discuss the results for a 64^ x 128 lattice. In the 
strong-scaling case relevant for HMC, the non-DD solver of [15] levels off at about 128 KNCs, 
while our DD solver still scales roughly linearly up to the maximum partition size of 1024 KNCs, 
with a time-to-solution gain of about a factor of 5. In the weak-scaling (or minimum-cost) case 
relevant for data analysis, our DD solver outperforms the non-DD solver by about a factor of 2 ~ 3. 

4. Summary and outlook 

We have presented the architecture of QPACE 2, a massively parallel and very densely packed 
system of KNC processors coupled by a two-layer network based on PCIe and EDR Infiniband. 
Two complete bricks have been running synthetic benchmai'ks as well as lattice QCD simulations 
continuously for over a month without errors. The full machine is currently (Eebruary 2015) being 
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assembled and tested in Regensburg and will enter production mode soon. High-performance 
solver code based on domain decomposition as well as application code is already available, see 
Sec. 2.6. The plan is to make most of our codes publicly available in the near future. 

The design of QPACE 2 allows for an upgrade to the next version of the Xeon Phi processor, 
code-named Knights Landing (KNL). Compared to the KNC, this processor will have three times 
the floating-point performance at the same or even less power, better cores (with out-of-order exe¬ 
cution and hardware prefetcher), high-bandwidth memory integrated on the package, a PCIe Gen3 
interface, and in some versions an integrated network controller (Omni-Path). A follow-up project, 
QPACE 3, will use the KNL processor and have a target peak performance of about 4 Pflop/s in 
double precision. The plan is to install this system at the Jiilich Supercomputing Center in 2016. 

We thank the German Research Eoundation (DEG) for the funding that made this project possi¬ 
ble, the machine shop and the electronics lab of the Regensburg University Physics Department for 
their contributions to the design, and T-Platforms for their support at an early stage of the project. 
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